3  Overview

The definition of the data mapping can be performed using the Rabbit-In-A-Hat tool that starts with a profile of the database made by the White Rabbit tool. RH4131 were unable to run WhiteRabbit on the cloud node. edenceHealth provided a format for the scan reports and RH4131 generated scan reports through tailored scripts based on the scan report format, so as to reverse-engineer conventional White Rabbit scan reports. In Appendix B, a data dictionary is presented for all the tables and fields that have been profiled.edenceHealth were then able to create White Rabbit-style scan reports based on the provided scan reports.

3.1 Source data

Rigshospitalet provided Parquet files for 2 of 3 sites (expectedly RH4131, Hvidovre, and Odense) containing ICU data. Each Parquet file contains data and/or information about one of the following five tables:

  • prescriptions

  • administrations

  • diagnoses_procedures

  • observations (actually contained in multiple files, named observations-*.parquet)

  • course_metadata1

  • t_person2

Initially, RH4131 provided three data-source scan report-like files for each site:

  • database_scan: contains the number of rows for each table

  • table_scan: includes information about the columns contained within each table, including data type, uniqueness, missing, etc.

  • field_scan: contains the data for each column within each table

Therefore, there is a slight nested quality to the data files. database_scan and table_scan contain information usually seen in the first two sheets of a scan report. field_scan contains the data usually seen in the following sheets (one per table) of a field scan; however, here it’s all contained in one table, field_scan.

In addition, RH4131 has provided the following three files:

  • shak_lookup.tsv: tab-separated file with SHAK codes and care-site metadata such as postal code and official name. This will be used during the ETL.

  • drug_mapping_helper.tsv: tab-separated file with prescription data (including ATC, dose, dose unit, route, drug names) to be used before the ETL to populate the STEM table.

  • course_id_cpr_mapping.txt: tab-separated file with three columns:

    • courseid: the visit identifier

    • timestamp: irrelevant for the purpose of the ETL

    • cpr_enc: the encrypted personal identifier

The exact columns included in each file are listed in Appendix B.


  1. Visits are called courses in the source data, from the Danish term forløb↩︎

  2. From the Danish Civil Registration System and holds data such as date of birth and sex (“CPR-Registeret - Sundhedsdatastyrelsen,” n.d.))↩︎